A Reinforcement Learning Method for Maximizing Undiscounted Rewards
نویسنده
چکیده
While most Reinforcement Learning work utilizes temporal discounting to evaluate performance, the reasons for this are unclear. Is it out of desire or necessity? We argue that it is not out of desire, and seek to dispel the notion that temporal discounting is necessary by proposing a framework for undiscounted optimization. We present a metric of undiscounted performance and an algorithm for finding action policies that maximize that measure. The technique, which we call Rlearning, is modelled after the popular Q-learning algorithm [17]. Initial experimental results are presented which attest to a great improvement over Q-learning in some simple cases.
منابع مشابه
Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Hölder continuity of rewards and transitio...
متن کاملAn Analysis of Direct Reinforcement Learning in Non-Markovian Domains
It is well known that for Markov decision processes, the policies stable under policy iteration and the standard reinforcement learning methods are exactly the optimal policies. In this paper, we investigate the conditions for policy stability in the more general situation when the Markov property cannot be assumed. We show that for a general class of non-Markov decision processes, if actual re...
متن کاملAn Analysis of non-Markov Automata Games: Implications for Reinforcement Learning
It has previously been established that for Markov learning automata games, the game equilibria are exactly the optimal strategies (Witten, 1977; Wheeler & Narendra, 1986). In this paper, we extend the game theoretic view of reinforcement learning to consider the implications for \group rationality" (Wheeler & Narendra, 1986) in the more general situation of learning when the Markov property ca...
متن کاملAn Average - Reward Reinforcement Learning
Recently, there has been growing interest in average-reward reinforcement learning (ARL), an undiscounted optimality framework that is applicable to many diierent control tasks. ARL seeks to compute gain-optimal control policies that maximize the expected payoo per step. However, gain-optimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish ...
متن کاملImproved Regret Bounds for Undiscounted Continuous Reinforcement Learning
We consider the problem of undiscounted reinforcement learning in continuous state space. Regret bounds in this setting usually hold under various assumptions on the structure of the reward and transition function. Under the assumption that the rewards and transition probabilities are Lipschitz, for 1-dimensional state space a regret bound of Õ(T 3 4 ) after any T steps has been given by Ortner...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1993